Practical Issues for Automated Categorization of Web Sites
نویسنده
چکیده
In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata and requirements for text features. We present an approach for targeted spidering including metadata extraction and opportunistic crawling of specific semantic hyperlinks. We describe a system for automatically classifying web sites into industry categories and present performance results based on different combinations of text features and training data.
منابع مشابه
Using neighborhood information for automated categorization of Web pages
In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages. We consider a Web page expansion by using text content of it’s linking pages. We show that naive expansion can grab too much noise and essentially harm categorization results. We present the approach to automated pruning of linking Web...
متن کاملRun-time Management Policies for Data Intensive Web sites
Web developers have been concerned with the issues of Web latency and Web data consistency for many years. These issues have become more important in our days since the accurate and imminent dissemination of information is vital to businesses and individuals that rely on the Web. In this paper, we evaluate different run-time management policies against real Web site data. We first define the me...
متن کاملImage flip CAPTCHA
The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...
متن کاملFunctionality-Based Web Image Categorization
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Identifying the functional categories of these images has important applications including information extraction, web mining, web page summarization and mobile access. This paper describes a study on the functional cate...
متن کاملAutomated Support for Older Adult Accessibility of E-Government Web Sites
The NSF-funded research described in this paper focuses on the development of automated software tools for improving Web accessibility for older adults. The Web offers great promise for immediate access to government information and resources that might not otherwise be available. Yet, there are design and information content barriers to the use of these Web sites making them virtually inaccess...
متن کامل